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Abstract 

When a perturbation is applied in a sensorimotor transformation task, subjects can adapt and maintain performance by 
either relying on sensory feedback, or, in the absence of such feedback, on information provided by rewards. For example, 
in a classical rotation task where movement endpoints must be rotated to reach a fixed target, human subjects can 
successfully adapt their reaching movements solely on the basis of binary rewards, although this proves much more difficult 
than with visual feedback. Here, we investigate such a reward-driven sensorimotor adaptation process in a minimal 
computational model of the task. The key assumption of the model is that synaptic plasticity is gated by the reward. We 
study how the learning dynamics depend on the target size, the movement variability, the rotation angle and the number 
of targets. We show that when the movement is perturbed for multiple targets, the adaptation process for the different 
targets can interfere destructively or constructively depending on the similarities between the sensory stimuli (the targets) 
and the overlap in their neuronal representations. Destructive interferences can result in a drastic slowdown of the 
adaptation. As a result of interference, the time to adapt varies non-linearly with the number of targets. Our analysis shows 
that these interferences are weaker if the reward varies smoothly with the subject's performance instead of being binary. 
We demonstrate how shaping the reward or shaping the task can accelerate the adaptation dramatically by reducing the 
destructive interferences. We argue that experimentally investigating the dynamics of reward-driven sensorimotor 
adaptation for more than one sensory stimulus can shed light on the underlying learning rules. 
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Introduction 

Transformations that map sensory inputs to motor commands 
are referred to as sensorimotor mappings [1]. While sensorimotor 
mappings are already formed at early stages of development [2] , 
they are subject to modifications, since the brain, the body and/ or 
the environment are constantly changing. Plasticity in sensorimo- 
tor mappings has been extensively studied in situations where 
subjects receive sensory feedback during the task, allowing them to 
correct their motor actions and to adapt to the induced 
perturbation. These include visuomotor rotation [3], reaching 
movements under forcefields [4], adaptation in a smooth pursuit 
eye movements [5], prism adaptation [6], and pitch perturbation 
in songbirds [7] and in humans [8]. 

Although these studies involve different sensory modalities and 
different effectors, they are similar in the sense that they all have 
sensory goals (targets) and a motor gesture is made to reach the 
target. They consist of three phases namely a standard phase, in 
which subjects perform the task under regular conditions followed 
by an adaptation phase, where subjects perform the same task 
under the perturbed condition and a washout phase during which 
the perturbation is removed, and the subject readapts toward 
baseline. Remarkably, in all these three phases, movements display 



substantial trial to trial variability. Recent theoretical as well as 
experimental studies suggested that this variability plays a crucial 
role in sensorimotor learning and adaptation processes [9-11]. 

Another issue concerns the ability of subjects to generalize the 
adaptation from one context condition to a different context. This 
has been investigated by testing how subjects perform upon 
presentation of sensory stimuli that were not present during the 
adaptation phase [12,13]. Generalization is usually good for 
sensory stimuli that are similar to the one used during adaptation 
and degrades as the sensory stimuli become different [3,14]. 
Remarkably, subjects can even perform worse than in baseline 
(negative generalization) for sensory stimuli which are very 
different from those which was presented to the subject during 
adaptation. This has been observed, for instance, in motor 
reaching tasks, when the tested stimulus is presented in a direction 
which is opposite to the adapted direction [4,14]. 

The above mentioned studies implicitly assumed that the neural 
mechanisms for adaptation are driven by a sensory feedback, 
which supplies a continuous error signal to the subject. Yet, recent 
studies show that adaptation is possible even without any sensory 
feedback, when only a binary reward that informs on a success or 
a failure of a trial is provided to the subject [15-17]. Moreover, 
recent experimental works suggest that reward based mechanisms 
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Author Summary 

The brain has a robust ability to adapt to external 
perturbations imposed on acquired sensorimotor transfor- 
mations. Here, we used a mathematical model to 
investigate the reward-based component in sensorimotor 
adaptations. We show that the shape of the delivered 
reward signal, which in experiments is usually binary to 
indicate success or failure, affects the adaptation dynamics. 
We demonstrate how the ability to adapt to perturbations 
by relying solely on binary rewards depends on motor 
variability, size of perturbation and the threshold for 
delivering the reward. When adapting motor responses to 
multiple sensory stimuli simultaneously, on-line interfer- 
ences between the motor performance in response to the 
different stimuli occur as a result of the overlap in the 
neural representation of the sensory stimuli, as well as 
the physical distance between them. Adaptation may be 
extremely slow when perturbations are induced to a few 
stimuli that are physically different from each other 
because of destructive interferences. When intermediate 
stimuli are introduced, the physical distance between 
neighbor stimuli is reduced, and constructive interferences 
can emerge, resulting in faster adaptation. Remarkably, 
adaptation to a widespread sensorimotor perturbation is 
accelerated by increasing the number of sensory stimuli 
during training, i.e. learning is faster if one learns more. 



also affect the adaptation dynamics in sensorimotor tasks even 
when a sensory feedback is available [18,19]. 

However, and not surprisingly, adaptation relying solely on 
rewards at the end of a trial is more difficult than when a sensory 
feedback on the performance is provided continuously during the 
task, as adapting with sensory feedback conveys more information 
regarding errors. For instance, when visual feedback is available in 
visuomotor rotation tasks, subjects adapt to large perturbation (e.g. 
30 degrees) in a few dozen trials [3,20], while in the absence of 
such feedback, but with binary (success or a failure) reward 
feedback, subjects find it notoriously difficult to adapt. Recent 
studies, nevertheless, have shown that it is possible to adapt to 
large perturbations relying solely on rewards if the size of the 
perturbation is slowly increased between rewarded blocks of trials 
[17,21]. The fact that progressively increasing the amount of 
perturbation makes it possible to adapt, even when the perturba- 
tion is large, is reminiscent of the classical shaping strategy [22]. In 
shaping, the difficulty of the task is increased gradually in order to 
accelerate learning, or to even make it possible. Although shaping 
is routinely used in laboratories when training animals to perform 
complex sensorimotor and cognitive tasks [23-25], it is only in 
recent years that it started to be explored in a theoretical 
framework [26-28]. 

What neural mechanisms could be involved in this reward 
based learning? Recent experimental evidence [29-31] indicates 
that rewards modulate local synaptic plasticity via global 
neuromodulatory signals, e.g. dopamine. When combined with 
the popular idea that synapses are modified according to Hebbian 
rules, this leads to the hypothesis that reward signals interact with 
local neuronal activity to modulate synaptic efficacies [32,33]. This 
theoretical paper aims to provide qualitative as well as quantitative 
insights into the conditions in which sensorimotor adaptation 
relying solely on rewards can take place. More specifically, we 
assume that a local learning rule based on the coactivation of pre 
and postsynaptic neurons is gated by a binary reward signal is the 
neural basis for modifications of synaptic efficacies [32,34,35]. 



We focus here on adaptation to a rotation during reaching 
movements where subjects are asked to move a cursor on a screen 
to bring it within a circular target while the cursor trajectory is 
rotated (perturbed) by some angle with respect to the hand 
trajectory. These perturbation tasks are classically used in 
behavioral studies of sensorimotor adaptation [3] . We consider a 
simplified network model of this task where adaptation relies solely 
on binary rewards [17]. The simplicity of the model allows us to 
analytically study several aspects of the adaptation dynamics. 
Combining these results with numerical simulations enables us to 
investigate the ways in which the learning dynamics depend on the 
model parameters. The key question is how the dynamics of 
adaptation are affected when the task involves multiple targets. 
Four main findings are reported: interferences can occur when 
adapting to multiple stimuli, interferences can slow down the 
adaptation dynamics dramatically, this depends on the (binary, 
stochastic) reward, and the slow down can be overcome by using 
shaping strategies. 

Results 

We consider the classical rotation experiment [3] in which a 
subject has to move a cursor on a screen to bring it within a 
circular target with a radius of y^; see Figure 1A. At the beginning 
of the experiment there is no discrepancy between the movement 
of the hand and the movement of the cursor. We assume that the 
subject is able to generate the appropriate hand movement to 
perform the task correctly. A perturbation is then introduced, so 
that the cursor trajectory is rotated by an angle y with respect to 
the hand trajectory. The subject has to adapt his movements to 
this new condition. 

In the present work, we focus on the case where the subject 
receives no visual feedback about the trajectory of the cursor. The 
only information on performance is a reward provided by the 
experimentalist at the end of a trial, according to the location of 
the cursor with respect to the desired target. 

Our simplified model for a network which generates the 
reaching movement is depicted in Figure IB. Its input layer 
consists of sensory neurons tuned to the location of the target. It 
has the geometry of a ring: the preferred direction (between 0° and 
360°) of a neuron corresponds to its location on the ring (see 
Eq(2)). Hence, when a target appears, the population activity 
profile in the input layer peaks around a location which is also the 
target direction. For simplicity we assume that the tuning curves of 
all the neurons have the same shape. Therefore, the shapes of the 
population activity profile and the tuning curves are identical. In 
particular, the tuning width, p, is also the width of the activity 
profile. 

The output layer consists of two linear units. Their activity 
encodes the (r\,r2) = r coordinates of the endpoint of the hand 
movement in the two dimensional environment. The connectivity 
matrix implementing the sensorimotor mapping between the input 
and the output layer is denoted by WeM? xN . In addition to their 
feedfoward inputs from the first layer, the output units also receive 
a Gaussian noise, £~./V(0,a 2 /) (see Eq(4)), where a is the SD of 
the noise (also referred to hereafter as the noise level). The vector 
representing the endpoint of the cursor is obtained by rotating the 
output vector of the second layer, v, by an angle y (2 x 2 rotation 
matrix- D y ). 

The reward, R, delivered at the end of the movement, depends 
on the distance between the cursor and the target. Unless specified 
otherwise it is binary: R = 1 for a successful trial, i.e. if the squared 
distance is smaller than the target size, and R = 0, otherwise. The 
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Figure 1. Schematic description of the sensorimotor adaptation task and the model. A. The rotation task. From left to right: 1) A circular 
target (red circle) of radius yfl appears on the screen at direction 9 (here 9 = 0°) to instruct the subject where to move the cursor. 2) The subject 
moves the cursor, which is invisible to him, toward the target (blue arrow). The only information available to the subject on his performance is the 
reward, delivered only if the cursor falls within the target. 3) A perturbation is introduced: the cursor is rotated by an angle y with respect to the 
direction of the subject's hand movement (black arrow). 4) A learning phase follows where the subject progressively adapts to the perturbation, 
reducing the distance between the cursor endpoint and the target. B. Schematic description of the model. When the target appears, the activity 
profile of the input layer (red neurons) peaks around the target direction. The parameter p controls the width of the activity profile. The connectivity 
matrix between the input and the output (blue neurons) layers is denoted by W. A Gaussian noise with zero mean and a standard deviation of a is 
added to the output layer of the network. The two-dimensional output vector rotated by the matrix D y represents the cursor endpoint. A reward is 
delivered if the distance between the cursor endpoint and the center of the target is smaller than y/e. The connectivity matrix W is then changed 
according to a reward-modulated plasticity rule (see Eq(8)). 
doi:10.1371/journal.pcbi.1003377.g001 



target size is controlled by the parameter e and therefore e is 
referred to as the target size in the text. 

Following trial t, the network adapts to the rotation by 
modifying the connectivity matrix, W, according to the reward- 
gated synaptic plasticity rule [32,36-38]: 

W{t) =W{t-\) + n R(m)F T (8(t)) 

where rj is the learning rate, £ is the noise in the output layer and 
F(6) is the activity of the input layer in response to the 
presentation of a target in direction 6. We will assume that the 
initial value of the connectivity matrix is such that without noise, 
the network performs the task perfectly for all target directions 
when y = 0° (See Eq(9)). More details about the model are given in 
Materials and Methods. 

The simplicity of the model allows for analytical calculations in 
the limit of small targets and a better understanding of the learning 
dynamics. However, the results reported here are grounded on the 
assumption of a reward-modulated learning rule and are 
qualitatively independent of the simplifying assumptions used to 
construct the model. For instance, as shown in Figure S2, the 
results still hold qualitatively in a more complicated network 
architecture with a different decoding scheme. 

The learning dynamics for one target 

We first consider the case where the network has to adapt to a 
rotation of the cursor when only one target is presented. Figure 2A 
(left) plots the evolution of the error (see Eq.(5)) with the number of 



trials, hereafter referred to as the learning curve, while the network 
adapts to an imposed rotation with an angle y = 30°. On the right 
panel we plotted for the same parameters the learning curve of the 
directional error, which takes into account only the direction of the 
movement. 

The error is large at the beginning of the process and decreases 
with the number of trials. Importantly, the dynamics strongly 
depend on the noise. For a low noise level (Figure 2A, a = 0.1), the 
error remains large for many trials and learning is slow. When the 
noise level is higher (Figure 2B, a = 0.2) the error declines faster. 
However, this comes at the cost of increasing the error after 
learning: the median of this error, called hereafter the final error (see 
Materials and Methods), is larger when the noise level is larger. 
Similarly, the probability that the network will perform the task 
successfully, improves more rapidly with the number of trials for 
(7 = 0.2 than for a = 0.1, but at very long time it is larger in the 
latter (0.824 + 0.001) than in the former (0.443 + 0.004) case. 

The learning curves plotted in Figure 2A-B were obtained for 
particular realizations of the noise, To provide a statistical 
characterization of these dynamics, we estimated the distributions 
of the logarithm of the learning duration over many 

realizations of the noise (see Materials and Methods). As shown 
in Figure 2D, this distribution shifts toward longer learning 
duration as the noise level decreases. 

Figures 2A and 2C plot the learning curves for £ = 0.05 and 
£ = 0.1 for the same noise level. The learning is substantially faster 
for £ = 0.1 but the final error is larger in this case. This is because 
when the target size is large, a reward might also be delivered for 
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Figure 2. Learning dynamics when the network adapts to the rotation for one target. A. An examples of a learning curve for e = 0.05, (7 = 0.1. 
Left: the error is calculated as the squared distance between the cursor endpoint and the target (see Eq. (5)) and plotted as a function of the trial number. 
The rotation perturbation is applied on trials following t = 0. For display purposes, only one in four trials is displayed. The solid line represents the error, 
smoothed with a 100 trials sliding median window. Final error of 0.02 + 0.001 (mean± SE, computed as explained in Materials and Methods). Dashed 
purple line: Target size. Right: as in left, but only the directional part of the error is plotted against the trial number. The shaded area corresponds to the 
target size. B. Same as in the left panel of A. but with e = 0.05,a = 0.2 and corresponding final error of 0.06 + 0.001 . C. Same as in the left panel of A. but 
with e = 0.1, cr = 0.1 and a corresponding final error of 0.03 + 0.001. D. Probability density function (p.d.f.) of the logarithm of the learning duration. The 
learning duration (t l ) is defined as the number of trials it takes to learn the task (see: Materials and Methods). Target size is e = 0.05. E. Trade-off between 
learning duration and final error. Average of log 10 t l distribution (green) and the final error (blue) are plotted against the target size. The shaded area 
around the averages corresponds to half SD of the distributions. Solid lines: o = 0.1. Dashed lines: a = 0.2. F. The probability of getting the first reward, /?i 
(see Eq. (10)), vs. the noise level, o for two values of the target size. In all the panels: y = 30°. 
doi:10.1371/journal.pcbi.1003377.g002 



less precise movement, i.e., for large errors. Figure 2E plots the log 
learning duration and the final error averaged over 1,000 
realizations vs. the target size: when increasing the target size, 
the learning duration rapidly decreases, whereas the final error 
increases. 

When the noise level or the target size are increased, the 
dynamics are typically faster because the probability of generating 
rewarded trials at the beginning of the learning is larger. As this 
probability increases, the time for the network to generate a 
rewarded trial decreases, leading to more updates in the 
connectivity matrix IV; hence the probability of the following 
trials to be rewarded increases further. This argument can be 
made more quantitative if one considers how the time to get the 
first reward depends on a and e. It has a geometrical distribution 
with a parameter p\ (see Eq.(10)), which is the probability to get 
the first reward. Lower values of p\ increase the expectation time 



to the first reward, and thereby the learning duration. When the 
noise level is low and the initial error is larger than the target size, 
the network explores a small region of the two dimensional space 
and the probability of getting a reward is small. In contrast, for 
very large noise the target is missed most of the time. The 
probability p\ therefore varies non-monotonically with the noise 
level (Figure 2F). The dependency on target size is simpler: p\ 
increases monotonically with target size, as it is more likely to 
reach a larger target. 

Performance depends on the learning rate para- 
meter. Obviously, the number of trials required to adapt also 
depends on rj, which scales the increment in synaptic strength 
following a rewarded trial. If the rate is too small, the adaptation 
will be extremely long, even for large noise or big target size. On 
the other hand, if this rate is too large learning is likely to be 
impossible. 
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Figure 3. Performance and noiseless performance after 
learning depends on the learning rate. A. An example of the 
variations of the error (blue) and the noiseless error (red) with the 
number of trials for e = 0.05 (purple dashed line), (7 = 0.15 and a 
normalized learning rate (77 = 0^, see Eq. (12)) of 0.3. For display 
purposes, only one in four trials is displayed. B. The performance (blue), 
i.e., the probability that E^<e and the noiseless performance (red), i.e., 
the probability that Eq<c are plotted against the normalized learning 
rate. These quantities were estimated from simulations of 10 7 trials, 
while excluding the transient learning phase. Note that for fj<\ the 
noiseless performance is perfect. The standard error of the mean is too 
small to notice. C. Distribution of the noiseless error, E 0l at the end of 
the learning phase. For 77 = 0.3, the support of the distribution is 
bounded by e. For fj = 1, the distribution is uniform for E 0 < e and zero 
otherwise. For 77=1.5 the support of the distribution is bounded but 
extends beyond e. In B and C: e = 0.1; cr = 0.2. 
doi:1 0.1 371 /journal.pcbi.1 003377.g003 



To analyze how fj affects the learning of the task it is convenient 
to decompose the error at trial t, E^(t), (Eq.(5)) into: 

£ e =£o + 2<f£ 0 + ll£l| 2 

where £9 = \\Eq\\ 2 (Eq.(6)) on trial t does not depend on the noise, 
%(t) (for more details, see Materials and Methods). We therefore 
refer to Eq as the noiseless error. Changes in the noiseless error are 
due to updates in the connectivity matrix, W, and only occur after 
rewarded trials. In particular, the noiseless error rarely changes at 
the beginning of learning, when the probability of getting a reward 
is low (Figure 3A). The two other terms depend on the noise at 
trial t. 

We also define the noiseless performance after learning as the 
probability that the noiseless error will be smaller than the target 
size at large time. In Figure 3 A, the noiseless performance 
corresponds to the number of trials (red circles) that fall below 
target size, divided by the number of trials (see also Materials and 
Methods), when the number of trials is large. 



Figure 3B plots the performance (blue) and the noiseless 
performance (red) for e = 0.1 and a = 0.2 vs. the normalized learning 
rate, fj = rja (where a is a constant; see Materials and Methods). 
The noiseless performance is perfect for fj < 1 . It quickly 
deteriorates when fj increases beyond 1 , until it becomes extremely 
small around fj = 2. Performance decreases monotonically with fj 
until it reaches 0 around f\ = 2. Similar qualitatively results were 
obtained for other values of e and a (results not shown). 

To better understand how the noiseless performance changes 
with f], we solved the learning dynamics in the limit of small target 
size (e->0) analytically. In this limit, the time between rewarded 
trials diverges. Using the fact that when a trial t is rewarded, the 
noise, %(t), is uniquely determined in this limit, we computed the 
trajectory of the noiseless error analytically as a function of the 
number of rewarded trials; see Materials and Methods. In 
particular, the noiseless error goes to zero for a large number of 
trials if fj is smaller than 2 and diverges for fj larger than 2. 

When e ^ 0 the noiseless error continues to fluctuate with time 
(as in Figure 3 A) in the range (0, E max ), where E max depends on e 
and f\. This maximal value can be calculated analytically as shown 
in Materials and Methods: 



Y\<\ 

, \<f\<2 



(i/fj-iy 

go fj>2 



The dependency of noiseless performance with f\ (Figure 3B) stems 
from this result. When fj < 1 the noiseless error is always smaller 
than the target size (see example in Figure 3C). Therefore the 
noiseless performance is always 1 . For fj = 1 the distribution of the 
noiseless error, can be calculated analytically (the proof is beyond 
the scope of this paper). It is uniform in the range (0,e) (blue line in 
Figure 3C). For 1 <fj <2 the noiseless error can be larger than the 
target size (see example in Figure 3G) and noiseless performance is 
no longer perfect. In fact as fj increases, the distribution becomes 
wider (its SD increases) and noiseless performance decreases 
monotonously. Finally, when fj>2 the above equation predicts 
that the support of the noiseless error distribution is unbounded, 
and simulations show that it becomes wider; hence the probability 
of getting a reward is substantially smaller than for fj <2. 

While noiseless performance is always perfect for fj < 1 , 
performance can be improved by taking smaller values of fj 
(Figure 3B). This is because the distribution of the noiseless error is 
sharper when fj is smaller. However, decreasing fj has the obvious 
consequence of increasing the learning duration. Figure 2 shows 
that for fj = 0.3 it takes only a few dozen trials to adapt perfectly if 
the target size is £ = 0.1. Nevertheless, for smaller e the number of 
trials increases dramatically. For instance, for £ = 0.02 this number 
becomes extremely large (much larger than 10 8 ) even if fj = 0.3. 

Accelerating the adaptation by shaping the task or the 
reward. Shaping is a well-known strategy in the context of 
operant conditioning, which allows a subject to learn difficult tasks 
in a reasonable amount of time [22]. In shaping strategies, the 
difficulty of the task is progressively increased. For a given degree 
of difficulty, the subject has to learn to perform the task, his 
performance is monitored, and when it is considered sufficiently 
satisfactory by the experimentalist, the difficulty of the task is 
increased. A shaping strategy has recently been successfully 
applied to allow subjects to learn the sensorimotor rotation task 
relying solely on a reward signal in the absence of visual feedback 
[17]. In this section, we apply shaping strategies in our model to 
examine to what extent learning can be facilitated or accelerated. 
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In the specific case of our sensorimotor adaptation task, the 
difficulty of the task depends on the target size, the rotation angle 
and the noise level. For fixed noise level and rotation angle, 
learning can be shaped by initiating the adaptation process with a 
large target size and then reducing the size progressively until it 
becomes as small as desired. This can be implemented as follows. 
The learning process begins with an initial value of the target size 
£ = £o, which is large enough for adaptation to be easy and fast. 
The target size is kept constant, while monitoring the running 
average of the reward. When the latter approaches a steady state, 
the target size is decreased by Ae (and the running average of the 
reward is reinitialized to zero). We repeat this step until the target 
size reaches the desired value Q. An example of such a shaping 
strategy is depicted in Figure 4A. Here we plot the learning curve 
for Cd = 0.02, when the adaptation is performed in the presence of 
very small noise (a = 0.05), starting with £o = 0.2. Within fewer 
than 200 trials the network has adapted and reached a 
performance of 0.893 + 0.001. In fact, if the adaptation had been 
performed with fixed value of 6 = 6^ = 0.02, the probability of 
getting the first reward in fewer than 10 8 trials would essentially be 
zero (Pr(TL < 10 8 )= 10~ 8 ), making the network unable to adapt 
without a tremendous number of trials. 

Another example of acceleration by shaping is depicted in 
Figure 4B. Here, as in Figure 4A, the network has to adapt to a 
rotation of 30°. We used similar parameters as in [17] (e = 0.027, 
corresponding to a target with a 3° radius and cr = 0.05). 
Adaptation is performed using a constant target size, but at the 
beginning of the learning the angle of the rotation is small and is 
progressively increased with steps of Ay = 4.2° every block of 25 
trials. The figure shows that the network adapts in fewer than 200 
trials. However, in some of the realizations the network was unable 
to follow the gradual rotation (see inset). To avoid such cases, one 
can take smaller rotation steps for longer block of trials, as in [17]. 
Another possibility is to monitor the running average reward and 
to change the rotation angle when the latter approaches a steady 
state, similarly to what we did with the adaptive target size above. 

Binary rewards, as typically used in operant conditioning, 
provide the subject with a limited amount of information about his 
performance. For instance, in our model, a binary reward does not 
convey any information regarding the exact distance between the 
cursor and the center of the target in case of a miss nor in the case 
of a success. One way to accelerate adaptation is to shape the 
reward, i.e., to perform the learning using a reward that depends 
smoothly on the error. One possibility is to use a deterministic 
reward given by 



A 0.5 



R- 



1 



l +e (^-e)/T 



(1) 



where Tis a smoothing parameter. Figure 5 A plots the learning curves 
for T = 0.0l (top panel) and T = 0.05 (bottom panel), for fixed 
values of target size and noise level (e = 0.05, a = 0.1). The network 
improves substantially faster in the latter case than in the former. 
However, after the error has stabilized, it is comparable in both 
cases. Figure 5B plots the average logarithm of the learning 
duration as a function of T. It shows that the learning duration 
increases rapidly for T—>0 } the limit where the reward becomes 
binary. Note that the learning duration varies non-monotonically 
with T(it is minimum at T = 0.05). This is because the learning 
duration also increases for large T since a reward which is overly 
smoothed is less informative. 

Remarkably, performance remains very close to 0.8 up to 
T = 0.05. Therefore, using a smooth reward with T = 0.05 




100 
Trial number 

Figure 4. Shaping the task allows the network to adapt to a 
large rotation angle (here y = 30 ) even if the target size and the 
noise level are extremely small. A. Shaping by decreasing the 
target size, as explained in the text. Parameters: eo=0.2; q = 0.02; 
Ae = 0.018; (7 = 0.05. Blue: The error is sampled every 3 trials (dots) and 
smoothed with a 50 trials median sliding window (line) vs. the number 
of trials. Purple: The size of the target. B. Reach angle (in degrees) as a 
function of the trial number when the rotation angle is progressively 
increased (see Results). The target size is fixed: e = q/ = 0.0027. At t = 0, 
y = 0°. The rotation angle is increased by 4.2° every 25 trials up to 
y= —30°. The shaded area corresponds to the target size (±3° around 
the target center). Inset: the network is unable to follow the gradual 
rotation for a different realization of the noise with the same 
parameters. In both panels: o = 0.05. 
doi:1 0.1 371/journal.pcbi.l 003377.g004 

reduces the learning duration substantially without affecting the 
performance of the network. For Tabove 0.05 performance drops 
rapidly and the learning duration becomes larger. Hence, in this 
case T~0.05 is optimal. We found similar behavior for other 
values of noise level and target size (not shown). 

Another way to provide more information to the subject on his 
performance, still relying on a binary reward as traditionally used 
in operant conditioning, is to deliver it stochastically with a 
probability decreasing smoothly with the error. This can also 
accelerate adaptation as shown in Figure 5B (dashed lines). Here 
the reward is a Bernoulli random variable with a parameter 

1/(1 +£T^). 

Altogether, our modeling study predicts that reward shaping 
strategies, e.g., providing a reward that is a smooth function of the 
error, as well as other shaping strategies, should be efficient in 
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Trial number Smoothing parameter (T) 

Figure 5. Shaping the reward function accelerates adaptation without impairing performance. A. The reward is given by 

R= — ( E „_ e y T - Top: learning curve for a reward function that changes abruptly around target size {T= 10~ 2 ). Bottom, main panel: learning curve 

for a gradual reward function {T = 5 x 10~ 2 ). Note the change in the abscissa scale. Inset: The reward function vs. the error. The target size is dashed 
purple line. B. The learning duration and the performance vs. the smoothing parameter, T. Solid lines: Deterministic smooth reward function as in A. 
Dashed lines: Stochastic binary reward delivered with a probability that depends on E$ (see Results). In A and B: e = 0.05; cr = 0.1. 
doi:1 0.1 371 /journal.pcbi.1 003377.g005 



enabling or accelerating such reward-driven sensorimotor 
adaptation. 

Generalization error. How does the network generalize the 
rotation for movement toward targets that were not presented 
during the adaptation process? To investigate this question we 
computed the generalization error, G.E. (see Materials and 
Methods) as a function of the angular distance, AO, between the 
target to which the network had adapted and a test target to which 
it did not adapt. For small target size G.E. can be calculated 
analytically (Eq. (20)). Figure 6A plots the results for different 
widths of the tuning curves, p. For narrow tuning curves (dashed- 
dotted line), G.E. is almost one (i.e., perfect generalization) only 
when the learned and the test targets are very close. When they are 
far apart, G.E. is almost zero. This is because the ability to 
generalize depends on the overlap, 5c(A0) (see Eq. (21)), between 
the activity profiles in the input layer of the network upon 
presentation of the learned and test targets. When the tuning 
curves are narrow, a(A0) is substantially different from zero only 
for very close targets and when they are far it is essentially zero. 
The range in the angular distance in which the generalization 
error is positive becomes broader when p increases (solid line). 
However, for wide tuning curves, G.E. is negative when the 
targets are far apart. This means that the network performance on 
far targets deteriorates compared to what it was before adaptation. 
Note that for intermediate values of p the generalization error can 
vary non-monotonically with AO (dashed lines). 

The generalization error described here reveals possible 
interactions between the learning processes for two distinct targets, 
since adapting for a rotation in one target modifies performance 
toward others. In what follows, we evaluate the impact of such 
interactions when adapting the reaching movements to two targets 
simultaneously and dissect the mechanisms underlying on-line 
positive and negative interactions. 

The learning dynamics for two targets 

What is the learning dynamics when the subject has to perform 
the task for two targets ? How does learning the task for one of the 




-180 -90 0 90 180 

Angle of test target (deg) 



Figure 6. The generalization error (G.E.) for a new target 
(defined as the test target), presented after the network has 
adapted to one target. G.E. is plotted as a function of the angle of 
the test target after adaptation to a target in direction 9 = 0°. Perfect 
generalization is when G.E. = l. Lines: Analytical result for e-»0 (see 
Eq.(19)). Circles: Simulation results for e = 0.01. For clarity, the results are 
displayed for test targets sampled every 15 degrees. The generalization 
error was averaged over 200 realizations of the noise. Shaded area 
represents one SB around the averages. Gray line: zero G.E.. The 
mapping between p and the half-bandwidth, Oi, is given in Eq. (3). For 
instance, p = 0.l corresponds to 0i~2O° and p = l to 0i~65°. 
Parameters: a = 0.14; y = 30°. 
doi:1 0.1 371/journal.pcbi.l 003377.g006 
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Figure 7. Delayed learning for two targets in opposite directions. A. Learning curves plotted against the number of trials for each of the 
targets, sampled every 10 trials. For the target that is learned first (resp. second) the curve is plotted in blue (resp. green). Top: cr = 0.1. Middle: 
cr = 0.14. Bottom panel: cr = 0.18. B. Distribution of learning duration for two opposite targets for different noise levels. Solid lines: The probability 
density functions of log 10 (blue) and log 10 (green) for the two targets (solid lines) where t^ 1} (resp. t£) is the learning duration for the target 
that is learned first (resp. second). Dashed lines: Distributions of log 10 and log 10 assuming that and are independent random variables. 
The distributions were estimated over 1,000 realizations of the noise. Simulations were long enough for the network to eventually adapt to both 
targets. Top: cr = 0.12. Bottom: (7 = 0.18. C. The average and the SD of the distributions of log 10 ^(blue) and log 10 (green) vs. the noise level. D. 

T (2) 

The distribution of the ratio -j^ for the two noise level values in B. 

L 

doi:1 0.1 371 /journal.pcbi.1 003377.g007 



targets affect learning the other one? We addressed these questions 
in numerical simulations, in which two targets were presented at 
an angular distance, A6, at consecutive times. Similar results were 
obtained when the targets were presented in a random order with 
equal probability. 

Delayed learning. The top panel of Figure 7A plots an 
example of the learning curves when the two targets are presented 
in opposite directions and the noise level is o = 0.1. Note that since 
this noise level and the target size are the same as in the bottom 
panel of Figure 2A, one might expect that learning the task would 
be fast. Remarkably, this is not the case here. The error for one of 
the targets decreases in fewer than 50 trials, beyond which it keeps 
fluctuating, most of the time below e. The corresponding 
performance (see Eq. (26)) is 0.835 + 0.005. This is in contrast to 
what happens for the other target, for which the error increases 



rapidly and keeps fluctuating for the whole duration of the 
simulation (1000 trials) around a mean that is much larger than e. 
Therefore, in this example, the network is able to adapt in a 
reasonable amount of time to only one of the targets, in spite of the 
symmetry of the task with respect to target identity. 

Increasing the noise has a dramatic effect, as shown in 
Figure 7A. For a = 0.14 (middle panel), the network is able to 
learn the task for both targets within 600 trials, but learning the 
second target is delayed. We term this effect throughout this paper: 
delayed learning. When increasing the noise level further (a = 0.1 8), 
the network adapts almost simultaneously to the two targets 
(bottom panel). 

This effect of the noise in suppressing delayed learning is 
confirmed in Figure 7B, where the statistics of the logarithm of the 
learning durations over many realizations of the noise are 
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Figure 8. Geometric intuition for the destructive and constructive interferences. Following the perturbation, the cursor is rotated with 
respect to the output of the network, hence inducing a large noiseless error (black vector in panel 1.). The noise in the output layer (green vector in 
panel 2) helps the network to explore the 2D environment, until the cursor falls inside one of the targets (panel 2). This trial is rewarded and therefore 
the connectivity matrix is updated, affecting the output of the network for the next trials. This decreases the noiseless error, for the target for which 
the trial has been rewarded, as the rotated output of the network is now closer to it (by adding the vector fjg, panel 3). This update moves the rotated 
output oway from the target in the opposite direction since the vector fja(A6)£ is away from it. This results in an increase in the error, referred to as 
destructive interference. The probability of a rewarded trial for this target is now substantially reduced, delaying learning for that target. A similar effect 
occurs when the two targets are sufficiently far apart. However, when they are close (panel 5) the interference becomes constructive, since after the 
update of the matrix, the rotated output gets closer to both targets. Note that the overlap, a(A0), depends on the width of the tuning curves (see 
Materials and Methods). 
doi:1 0.1 371 /journal.pcbi.1 003377.g008 



depicted. The learning duration for the first (resp. the second) 
learned target is denoted by t£ (resp. Tj? ). Obviously, the target 
for which adaptation occurs first depends on the specific 
realization of the noise. The distribution of logi^ (green) is 
shifted to the right with respect to the distribution of log (blue), 
as for each realization > , by definition. As a consequence of 
delayed learning, this shift is larger than would be expected if the 
task had been learned independently for the two targets (dashed 
lines). For low noise level this shift is even larger (top panel). 
Figure 7C shows the averages of the distributions of log and 
log vs. o. As it was the case for the average of log 1l for a single 
target (Figure 2C), these averages increase for low noise levels. 
However, the increase is faster for the second target. 

The delayed learning effect is also clear in Figure 7D which 
plots the distribution of the ratio: t^/t^, for the same values of a 
as in Figure 7B. For the highest noise level, in half of the 
realizations t^/t^<2. By contrast, for low noise level in more 
than half of the realizations the learning of the second target is at 
least 34 times longer than the first one. Overall, delayed learning is 
reduced when the noise level is increased. 

Destructive and constructive interference. This delayed 
learning can be understood with a geometrical argument, as 
explained in Figure 8. When the network generates a rewarded 
trial for one of the targets, it affects the outcome of the second 
target. Hence, when the targets are in opposite directions, and if 
the tuning curves are sufficiently broad, this results in an increase 
in the error of the second target (see also Figure 7 A). In other 



words, the learning processes for the two targets interfere 
destructively. As a result, the probability of generating a rewarded 
trial for the second target is reduced. Note that according to this 
argument if the targets are sufficiently close, the interference 
becomes constructive. 

To further analyze the interference in adaptation to the two 
targets, we considered the correlations between the errors at 
consecutive presentations of the targets. For that purpose, we 
estimated the time dependent correlation coefficient (CC(t)) of the 
errors over different realizations (see Materials and Methods). A 
destructive interference corresponds to negative correlations, 
whereas a constructive interference corresponds to positive 
correlations. Figure 9 A shows how the sign and the time course 
of the CC change with the angular distance, AO. For the first few 
trials, usually none of the presentations of the targets are rewarded 
and, therefore, the matrix W does not change. Hence, during the 
first trials, CC~0. For a sufficiently large number of trials the 
network adapts to the two targets and |CC(£)| reaches some 
stationary value. 

The results in Figure 9A show that the temporal profiles of 
CC(0 are qualitatively similar for A0= 180° and Afl = 80°, but in 
the latter case CC(i) is less negative, indicating a reduction in the 
destructive interference. By decreasing the angle further to 
A0 = 6O° the shape of CC becomes biphasic. In the latter case 
the nature of the interference changes during adaptation from 
constructive to destructive. Finally, for sufficiently small AO, the 
interference is always constructive. For the parameters in 
Figure 9A, this is already the case when AO = 30°. 
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Figure 9. Destructive and constructive interferences are a function of the model parameters. The correlation coefficient, CC(t), 
characterizes the strength and the nature of the interference during learning of the rotation task for two targets. A. CC(t) for different values of the 
angular distance between the targets. The interference becomes constructive when AO decreases. B. The extremum of CC{i) over t, CC* , plotted 
against AO for different values of p. Purple: p = l. Blue: p = 0A. Green: p = 0.2. The width of the curve was chosen to correspond to the SD of CC*, 
estimated by bootstrap. Note the slight non-monotonicity for p = 0A. Inset: CC(t) for A# = 90°, AQ= 120°, AQ= 180° for p = 0A (same color code as 
for the dots on the main figure in this panel). Parameters: e = 0.1, y = 40°, (7 = 0.14. C-F CC{t) is plotted for different values of o (C), e (D), p (E) and y 
(F). In all these figures, CC{t) was calculated over 1,000 repetitions. The result was low-pass filtered to suppress fast trial-to-trial fluctuations for the 
sake of clarity. Consequently, there is a causality artifact around t = 0 and CCV0, although it should be. The standard errors estimated by bootstrap 
are small and are not plotted. 
doi:10.1371/journal.pcbi.1003377.g009 



Figure 9B plots the extremum of CC(i), CC*, against A6, for 
different widths of the tuning curves. For broad and sharp tuning 
curves, CC* varies monotonously with A6 (Figure 9B, purple and 
green lines). For intermediate degrees of tuning (blue line), CC* 
can display non-monotonous variations with AO (see also the inset 
in the figure). In fact, it reveals that the interference can vary non- 
monotonously with the angular distance, depending on the width 
of the tuning curves. This non-monotonicity can be grasped from 
the geometric intuition in Figure 8. The interference is more 
destructive when A6 is large; however, as A6 increases, a(A6) 
becomes smaller, making the interference less effective. A more 
rigorous proof is given in Material and Methods. 

Similarly, the interference for fixed A6 depends on p as the 
overlap, a(A0), becomes smaller when p decreases. This is 
depicted in Figure 9C, where we plot CC(t) in the case of two 
targets in opposite directions, for three values of p. Decreasing the 
width of the tuning curves results in smaller values of |CC*|. For 
very sharp tuning curves, interferences are minimal and CC(i) 
remains very small during the whole learning process. In fact, in 
the limit p->0, the adaptation process to each of the targets is 
independent. 

Finally, Figure 9D displays CC{t) for three values of noise level. 
The same qualitative behavior is observed in all these cases; 
however, CC* is less negative and CC(t) recovers faster when the 
noise is stronger. This is because increasing the noise decorrelates 
the adaptation process for the different targets, thus reducing the 
destructive interference. This is in line with the results displayed in 
Figure 7. 

Destructive interferences are reduced by shaping the task 
or the reward. Increasing the target size (Figure 9E), as well as 
reducing the rotation angle (Figure 9F) reduces |CC*|, and hence 
the destructive interference, when adapting for two targets in 



opposite directions. Therefore, we expect that shaping strategies 
which gradually manipulate these parameters can help overcome 
the delayed learning effect. Figure 10 shows that this is indeed the 
case, when changing the target size adaptively during the 
adaptation. The running average of the reward signal for each 
target was monitored separately and e was decreased by Ae only 
when both running averages reached a steady state. In this case, 
the network adapts to both targets quickly and simultaneously. 
Similarly, shaping the task by increasing the rotation angle 
progressively reduces the destructive interference and accelerates 
the learning (data not shown). 

Finally, there is less interference if learning is performed with a 
reward which depends smoothly on the error (Eq. (1)). As depicted 
in Figure 10B, this results in a suppression in delayed learning. 
Increasing the smoothing parameter reduces |CC*|. For instance, 
for the parameters of Figure 10B, |CC*|~0.4 for T=10 _1 , 
whereas |CC*|~0.8 for T=\.2 10 -2 . Similar results are found if 
the reward is binary but stochastic, with a probability that is a 
function of E% (not shown). 

Learning faster by learning more 

How does the learning duration, i.e., the time to learn the task 
for all the presented targets, vary with the number of targets? We 
simulated the learning of m targets, whose directions were evenly 
distributed between 0° and 360°. We took a small target size 
(e = 0.01), so that up to 36 non-overlapping targets could be 
considered (for targets presented on a circle with radius 1). 

Figure 1 1A plots the average time to learn the entire task in 
terms of the total number of target presentations for a fixed noise 
level and different values of tuning widths. It shows a non- 
monotonic dependency with the number of targets. This contrasts 
the monotonically increasing learning duration when targets are 
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Figure 10. Shaping the task or the reward reduces the delayed 
learning effect. A. Learning curves for two targets in opposite 
directions. The task is shaped by reducing the target size. Parameters: 
O/ = 0.02; £o = 0.25; o = 0.05. The running averages of the reward were 
monitored for the two targets separately. When both averages reached 
a steady state the target size was decreased by Ae = 0.018. The error 
was sampled every 3 trials. B. Adaptation with a smooth reward 
function, Eq. (1). Top: r=1.210" 2 . Middle: r = 610" 3 . Bottom: 
r=10 -1 . Parameters: e = 0.1; (7 = 0.1. The error was sampled every 10 
trials. 

doi:10.1371/journal.pcbi.1003377.g010 



learned independently with the same noise level and target size 
(dashed line). 

Narrow tuning curves. When the tuning curves are narrow 
(black and blue curves) and for small values of m, the overlap oc(A0) 
is essentially zero; therefore, there is no interference and the 
network adapts independently to the different targets. An example 
is depicted in Figure 11B.1 for p = 0.1. In this figure, the noiseless 
error for all three targets is plotted against the number of rewarded 
trials. Independence is indicated by the fact that abrupt changes in 
the noiseless error for one of the targets do not affect the noiseless 
error for the other targets. The overlap only becomes significant 
when the targets are close enough, resulting in constructive 
interference (see also Figure 9A). In fact, when m increases, the 
adaptation for close targets interferes constructively, as depicted in 
Figure 11B.2 for m = 6. In this example, learning target 1 (see 
color coding in the figure) does not affect the learning of targets 3, 
4 and 5 within the first 200 rewarded trials. It does, however, 
reduce the noiseless error for the closer targets, i.e., 2 and 6. The 
constructive interference is also noticeable for the rest of the 
targets. This constructive interference between close targets 
facilitates adaptation and explains why the learning duration 
decreases for larger m, and the overall non-monotonicity of the 
learning duration with m. 

Wide tuning curves. For wider tuning curves, interferences 
are already present for a small number of targets, but they can be 
destructive when the targets are far apart. For instance, for p = 0.4 
and m = 3, improvements for one target result in an increased 
noiseless error, above the initial error, for the other targets 
(Figure 11B.3). However, as in this case p is not too large, 
adaptation is almost independent with m = 2 (green curve in 
Figure 1 1A). Similar to the case of narrow tuning curves, 
constructive interference between close targets emerges when m 
is increased. A representative example of adaptation with m = 6 
and p = 0A is plotted in Figure 11B.4. Learning target 1 reduces 
the noiseless error for the two close targets, whereas the error for 
the other three targets, which are farther apart, becomes larger 
than their initial values. In this case, constructive interference 
among the close targets competes with destructive interference 
between targets that are far apart. 

The drop in the learning duration when increasing m, both for 
wide and narrow tuning curves, implies that learning more targets 
might be faster than learning only a few. For instance, learning 6 
targets for p = 0.4 is six times faster than learning only three of 
them (the 3 that are separated by 120°). 

Adaptation is in the close-to-far order when the tuning 
curves are broad. In Figure 11B.4 (p = 0.4) the network 
learned the task in a specific close-to-far order: after it had learned 
the first target, it learned the two closest targets (separated by 
+ 60°), and then the far targets (separated by + 120° and finally 
the 180° target). Therefore, in this case the targets were learned in 
an ordered way. In contrast, in the example plotted in Figure 1 IB. 2, 
the tuning curves are narrow (p = 0.1) and the learning of the 
targets is not ordered. This difference stems from the fact that 
broadening the tuning curves increases the amount of both 
destructive and constructive interference. As a result, by learning 
one target, the error of the closer targets is already reduced, 
whereas learning is delayed for the far targets. Increasing p 
thereby results in more ordered learning. To better characterize 
how the tuning width controls whether adaptation is ordered or 
not, we estimated the probability of this occurring as a function of 
p. Figure 11C depicts the results for m = 6. It shows that the 
fraction of the realizations for which learning is ordered increases 
monotonically with p. 
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Figure 1 1 . Adaptation to multiple targets. A. Average total number of target presentations required to learn the entire task vs. the number of 
presented targets, m. The targets are evenly distributed (between 0° to 360 ). Black: p = 0.05. Blue: p = 0.l. Purple: p = 0.3. Green: p = 0A. Dashed 
black line corresponds to learning the targets independently from the p.d.f. of x Ll which was estimated from adapting to one target. B. Examples of 
the noiseless error during the learning, plotted vs. the number of rewarded trials. The target direction is color coded. Dashed gray lines: The initial 
noiseless error for y = 30°. B.I and B.2 are examples of the noiseless error for narrow tuning curves (p = 0.1) in the case of 3 and 6 targets 
respectively. The plateau in the noiseless errors indicates that there is no interference between the targets. B.3 and B.4 are examples of the noiseless 
error for wider tuning curves (p = 0.4) in the case of 3 and 6 targets respectively. The increase in the noiseless error above the initial error for some of 
the targets is the result of the destructive interference between far targets. C. The fraction of ordered realizations when ra = 6 as function of p. 
Chance level is 13.2%. An ordered realization is defined as learning the targets in a close-to-far order, as in the example in B.4. The statistics were 
calculated over 500 realizations. For all the results presented in this figure: (7 = 0.14. 
doi:10.1371/journal.pcbi.1003377.g011 



Generalization error for multiple targets. Figure 12A 
plots the generalization error after the network has adapted to 2 or 
3 targets for p = 1 . The generalization is essentially one for all 
tested targets as soon as the network has adapted for three targets 
(green line). How does the generalization error depend on m and 
p? Figure 12B plots the noiseless performance (see Eq. (25)) 
averaged over all the test targets (denoted by P t ), for different 
values of m and p. For wide tuning curves, as in Figure 12 A, 
learning the task with only 3 targets is sufficient for almost perfect 
performance on all the test targets (blue line, P t ~l). Therefore, 
there is no added value in adapting to more targets as far as 
generalization is concerned. However, as explained above, this can 
substantially accelerates learning. In fact, for the parameters used 
in Figure 12A the average learning duration is about 170 times 
shorter for m = 6 than for m = 3. When the tuning curves are 
narrower, the network only generalizes perfectly to all directions 
for large m (green and black lines in Fig 12B). Nevertheless, here it 
is also advantageous for the network to adapt to more targets than 
required for perfect generalization, since this can accelerate 
adaptation. 

Discussion 

We explored the reward-based component in adaptation 
processes in a setting in which a subject has to adapt reaching 
movements to a rotation when the only information available is 
the location of the target and a binary reward signal indicating 
success or failure on a trial [17]. The subject thus has to adapt to 
the perturbation by relying solely on the reward. In the framework 



of a simplified model of a neural network learning the task, we 
investigated the ways in which the adaptation dynamics depend 
on the noise level in the network, the target size, the size of the 
perturbation and the shape of the reward function. The key 
finding is that if the network has to adapt simultaneously to 
several target locations, constructive or destructive interferences 
between the different movements may occur. Such destructive 
interferences may result in a severe slowdown in the adaptation 
process, but this slowdown can be mitigated if the reward 
changes more gradually from a success to a failure around the 
target. 

If the motor variability is not large enough with respect to the 
target size and the amount of perturbation (Figure 2), it takes a 
long time for the network to generate rewarded trials and to 
update its connectivity matrix. This results in slow adaptation and 
may be the reason why adaptation in the absence of visual 
feedback is notoriously difficult for subjects when the rotation 
angle is too large. For example, at the noise level and target size 
reported in [17], the probability to generate a rewarded trial in less 
than 10 8 trials for a rotation of 30° is essentially zero. 

The time to adapt also depends on the size of the change in 
synaptic strength on each rewarded trial; i.e., on the learning rate 
parameter. We showed that perfect adaptation to one target (i.e. 
1 00% performance in the absence of noise) is possible only when 
the (normalized) learning rate is smaller than 1. A high learning 
rate leads to decreased performance and eventually fully impedes 
adaptation (Figure 3). Therefore, the extent to which adaptation 
can be accelerated by choosing a large learning rate is limited. 
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Figure 12. Generalization error (G.E.) and performance when adapting to multiple targets. A. The generalization error vs. the location of 
the test targets, estimated from simulations as in Figure 6. Shaded area represents one SD around the averages. Tuning width: p = 1. B. The noiseless 
performance (see Eq(25)), averaged over all the tested targets {P t ) is plotted vs. the number of trained targets. See Materials and Methods for details 
about how this quantity was estimated. Blue: p=l. Green: p = 0.l. Black: p = 0.05. Dashed gray: zero G.E. Parameters: e = 0.01, (7 = 0.14. 
doi:10.1371/journal.pcbi.1003377.g012 



Adaptation is faster for large noise. On the other hand, if the 
noise is too large, final performance is impaired. Interestingly, 
motor areas display high variability at the early stages of learning, 
which becomes smaller afterward. This has been observed in 
reaching tasks in monkeys [39], as well as in song acquisition in 
songbirds [40]. Our study suggests that this change in noise level 
during learning can be functionally important to making a 
compromise between fast adaptation and good performance. 

We showed that when adapting to multiple targets, learning the 
task for one target can impair performance on other targets due to 
destructive interference. As a result, the probability that the 
network will generate a rewarded trial for these targets decreases. 
Therefore, in this case the same noise level that allows exploration 
of one movement direction is insufficient when adapting to two or 
more targets, resulting in a delayed learning effect. Interestingly, 
when the network starts to adapt to the perturbation to the second 
target, it does not deteriorate the performance of the network on 
the first target that was already learned. This is because the 
network keeps generating rewarded trials for the first target and 
prevents the connectivity matrix from changing in the wrong 
direction for the first target. 

We also showed that there are cases where the interference that 
occurs when multiple targets are presented is constructive. In fact, 
the strength and the nature of the interference depend on the 
similarities in the distance between the targets (the physical stimuli) 
and in the overlap of the tuning curves (the neural representations 
of the stimuli). Adding more targets creates constructive interfer- 
ence and therefore can accelerate adaptation. 

Generality of the results 

Models of sensorimotor control and learning frequently assume 
minimizing a squared error function. This is convenient because of 
analytical or computational simplicity [13,14]. However, it was 
shown that although these models can be a good approximation 
they tend to penalize large errors excessively [41]. In contrast, we 
chose to explore adaptation with a binary reward function, as used 
in experiments. Our results and predictions stem from the shape of 
the reward function. Specifically, they do not depend qualitatively 
on the specific choice of the distance error used, but are based 
primarily on the fact that the reward function varies sharply with 
the distance to the target center. The dynamics of the adaptation 
to more than one target depend on the overlap between the tuning 



curves of the input neurons. However, the precise shape of the 
tuning curves is not crucial and the results are unchanged if one 
replaces the Von Mises function we used here with any other 
tuning curve function, such as a cosine tuning curve (see e.g. Eq. 
23). 

As a matter of fact, the results we describe are the outcome of 
the following: 1) the same system is used to learn the task for 
several targets, leading to interference which depends on the way 
in which the targets differ physically as well as in their neuronal 
representation and 2) learning the task for one target can 
deteriorate performance on another target such that the informa- 
tion provided by the reward when attempting to learn the task for 
it becomes small, thereby delaying the learning. These two 
properties of the learning process are not specific to the simple 
model we investigated here. 

In our model, the latter property stems from the fact that the 
reward varies sharply with the error. The learning rule we used is 
part of a general family of gradient-like reinforcement learning 
rules; i.e., learning rules that on average form a gradient ascent on 
the reward function [35-37]. In fact, learning with an on-line 
Gradient Ascent algorithm with a sigmoidal cost function can 
result in similar effects (Text SI; Figure SI). It might be claimed 
that plasticity also occurs when no reward is delivered [42]. 
Therefore, we also verified that the phenomenology of the model 
remains qualitatively the same when Re{ — 1,1} instead of using a 
0—1 reward function (unpublished data). Note that to avoid a drift 
of the output vector which occurs when Re{ — 1,1}, the synaptic 
weights must be normalized in this case after each trial. Another 
extension of our model would be to use a reward prediction error 
instead of an instantaneous reward; e.g., by subtracting a running 
average of the reward from the instantaneous reward. Delayed 
learning also occurs with this type of learning rule (results not 
shown). In fact, previous works have argued that this modification 
does not affect most of the qualitative behavior of the algorithm 
[32,36]. However, it should be noted that in the case of multiple 
targets, computing the running average of the rewards over all 
targets is an additional source of interference, as shown recently in 
[35] . To avoid this, the running average of the reward needs to be 
monitored for each target separately. 

We focused on the learning dynamics in a feed-forward network 
of linear neurons with only two layers. We chose this architecture 
for the sake of simplicity. However, we verified that similar 
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qualitative behaviors in terms of interference and delayed learning 
occur in a network model in which an intermediate layer 
consisting of nonlinear neurons was added, and in which a 
decoder provides the angle of reach movement instead of a vector 
(Text SI, Figure S2 and unpublished data). Note that in the 
framework of this more complex model, the noise can be 
unambiguously related to neuronal variability whereas in the 
simplified two-layer model considered in our paper, the noise is in 
the decoder. 

One limitation of our work is that we did not model the 
trajectory of the movement and/ or the muscle activation patterns 
needed to produce movements [43]. However, we expect that 
delayed learning and interferences also occur in a more detailed 
model of movement production, such as the one used in 
Legenstein et al. [34]. 

Relation to previous works and predictions 

A reward-based component in a sensorimotor task was shown to 
be involved in adaptation to rotations even when detailed spatial 
information regarding the error was provided to the subject 
[18,19]. We investigated the ways in which neural possible 
mechanisms that reinforce successful actions affect adaptation 
dynamics. This type of reward-based mechanism was also studied 
in [17]. In this experiment, subjects adapted without visual 
feedback to a gradually increasing rotation of 1° every 40 trials, up 
to an 8° rotation. Our modeling results are in line with these 
experiments (Figure 4B). We thus predict that shaping the reward 
also accelerates adaptation. 

Besides demonstrating that adaptation relying on rewards is 
possible by utilizing a gradual rotation paradigm, the Izawa and 
Shadmehr [17] results suggested that there is no change in the 
perceived sensory consequences of the motor commands; i.e., 
there should be no change in a "forward model". Therefore, in 
[17] adaptation was modeled by an action selection rule. Our 
model is similar to the latter, as we focused on the reward-based 
component during adaptation. However, our model differs in that 
it is value-free, whereas in [17] it involved value-based reinforce- 
ment learning. Nevertheless, our model can also account for the 
experimental results reported in [17] for one target (see Text SI, 
Figure S3). Moreover, it allowed us to investigate the generaliza- 
tion curve and possible interference during adaptation for multiple 
targets. 

The learning algorithm. Reward modulated learning rules 
have been used in previous modeling studies of sensorimotor tasks, 
such as birdsong acquisition [10] and motor learning in primates 
[34], Similar rules have also been implemented in models of 
decision making [32,35,44] and association tasks [45]. The reward 
modulated rule we used here is a special case of REINFORCE 
learning rules. As shown by Williams [36], REINFORCE learning 
rules are equivalent on average to a gradient ascent algorithm on 
the average reward function. In fact, the gradient ascent dynamics 
with the average reward function (Eq.(10), averaged over the 
different movement directions) can be computed analytically. 
However, for finite rj the actual trajectories can deviate 
substantially from the gradient ascent trajectory. In particular, 
delayed learning and the reduction in learning duration with the 
number of targets occurs for finite rj but these phenomena 
disappear when rj—>0 (unpublished data). 

Shaping. Shaping strategies are used to teach subjects to 
perform operant conditioning tasks in a reasonable amount of 
time [22]. They were recently applied in the context of 
Reinforcement Learning by either increasing the complexity 
of the task [27,46] or by shaping the reward function [26,27,47]. 
In the context of our model we showed that adaptation to one 



target can be accelerated if the target size or the rotation angle 
are progressively changed. This also reduces destructive interfer- 
ences, thereby accelerating adaptation to multiple targets as well. 
We also showed that reward shaping can efficiently suppress 
destructive interferences and accelerates adaptation without 
compromising on performance. 

To the best of our knowledge there are only a few theoretical 
works that have addressed shaping strategies in computational 
models in neuroscience (see e.g. [28]). Fiete et al. [10] used an 
adaptive threshold for reinforcement that adapts to performance. 
This is equivalent to the adaptive target size used here (Figure 4A). 
Smooth reward functions have been used in previous models of 
sensorimotor learning [34,35], but the ways in which the shape of 
the reward function affects learning were not addressed. 

Interference, delayed learning and generalization. The 
delayed learning effect exhibited by our network when it adapts 
to several targets is reminiscent of the slowing down that occurs in 
the model of birdsong learning in Fiete et al. [48] . In that model, 
a gradient ascent on a quadratic error function is performed by 
the network to learn a time dependent signal. The slowing down 
is due to destructive interferences in learning different temporal 
chunks of this signal. In fact, the presentation of multiple targets 
that involved a target in each trial, can be considered a discrete 
time dependent signal, and interferences when learning multiple 
targets can thus be seen as similar to interferences in different 
temporal chunks of the signal. However, in contrast to Fiete et al. 
[48], our network learns with a stochastic online learning rule, 
rather than a deterministic batch rule, and a different reward 
function is utilized. 

Fiete et al. [48] suggested that to avoid interferences the avian 
brain exploits sparse neural representations. This solution is 
qualitatively similar to narrowing the tuning curves in our model. 
Similarly, Tanaka et al. [13] showed that narrow tuning curves 
can explain the independent learning of multiple targets in the 
context of a visuomotor rotation task with visual feedback. 
However, narrowing the tuning curves is not the only way to 
suppress destructive interferences, in that we showed here that 
they can also be suppressed by increasing the noise level, 
increasing the number of targets, and shaping the task or the 
reward. 

Similarly to previous theoretical works on sensorimotor 
adaptation, we also showed that the shape of the generalization 
curve depends on the width of the tuning curves of the input 
neurons [4,13,14,49]. In [17] it was shown that generalization in a 
reward-based rotation task falls to half of its maximum value 
already at 10° apart from the adapted direction. However, 
generalization above 30° was not explored in this study. We 
therefore did not limit our model to a specific tuning width, as 
further experiments should be conducted to determine the 
generalization in the case of adaptation with rewards. 

Negative generalization have been experimentally observed, 
both in adaptation to reaching movements under force-fields [4] 
and in visuomotor rotations with visual feedback [14], In the latter 
study, the authors demonstrated that generalization curves are 
task-dependent, and showed how subjects negatively generalize 
the adaptation when targets that are far from the adapted target 
are presented. In fact, this study showed that generalization curves 
can even be non-monotonic. We predict here that this can also 
occur in the case of adaptation without sensory feedback. 

As far as we can ascertain,, delayed learning in sensorimotor 
adaptation has not been reported before. For delayed learning to 
occur in our model, adaptation to one target needs to impair the 
performance on other targets and the reward must change 
abruptly around the target from a success to a failure. Under the 
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assumptions we made, the shape of generalization curves can hint 
at on-line interferences that can be expected during adaptation. 
Therefore, because negative generalization was reported in a 
visuomotor adaptation task when the subject receives a continuous 
error [14], one might expect to find on-line interferences as well 
when visual feedback is available. However, in this case the error 
function does not change abruptly with respect to the distance to 
the target, as subjects are aware of the cursor location. Hence, 
when subjects receive visual feedback, we do not expect that 
interferences will result in substantial delayed learning or that 
learning will accelerate when the number of targets is large. We 
verified this expectation in the case of a quadratic error [13]. In 
particular, the learning duration increases monotonically with the 
number of targets and saturates when this number is large (Text 
SI; Figure S4). 

On the other hand, in the case of adaptation with binary rewards, 
we do expect that if there are angles for which generalization is 
negative, delayed learning will be noticeable, as the reward function 
changes abruptly from a success to a failure (Figure 10). 



Materials and Methods 

The task 

We consider a motor reaching task (see Figure 1A) in which a 
subject manually controls the location of a cursor on a screen to 
bring it within a circular target of radius yje [16]. The target 
location is characterized by a two dimensional vector r of norm 1 
(we assume that the target is always at distance 1 from the center of 
the screen) and direction 0. In a standard block of trials, the 
direction of motion of the cursor and the hand are the same. We 
assume that the subject is able to perform the task perfectly in this 
case. In a rotation block of trials a perturbation is introduced: the 
movement of the cursor on the screen is now rotated by an angle y 
with respect to the hand movement. To overcome this perturba- 
tion the subject must move his hand in a direction — y with respect 
to the target. Here we focus on the case where there is no visual 
feedback (the cursor is not on the screen): the only information the 
subject receives about his performance is provided by a reward 
signal delivered by the experimentalist [17]. 



Conclusions and perspectives 

The key finding of this theoretical work is that if a reward- 
modulated learning rule underlies adaptation, interferences are 
likely to be observed when learning multiple targets with a binary 
reward. It would be valuable to explore whether such effects 
occur in reward-based sensorimotor adaptation experiments with 
multiple sensory stimuli. We predict that for a binary reward 
function, destructive interferences will be observed if the neurons 
that encode the stimuli have broad tuning curves. These 
interferences are a dynamical counterpart of the generalization 
function and might result in a dramatic slowdown because of the 
abrupt change in the reward from success to failure around target 
size. We also predict that adding more targets should accelerate 
adaptation (Figure 11). From the learning curve of adaptation to 
one target, the rate and variability in which subjects adapt can be 
estimated. We predict that at parity of variability, subjects with 
larger learning rates will tend to display more destructive 
interferences and therefore slower adaptation to two targets (see 
Eq. (23)). By contrast,if the tuning curves are very narrow, 
destructive interferences are unlikely to be found. However, even 
in this case, when the stimuli are sufficiently close, constructive 
interferences should be observed. In this case as well, adding 
more targets should accelerate the adaptation. 

Another prediction is that if adaptation is driven by reward 
modulated plasticity rules similar to the one we used here, 
smoothing the reward function should reduce interferences. In our 
model, this stems from the assumption of a reward modulated 
learning rule and not from the simplifying assumptions we made in 
constructing the model. Therefore, we suggest that testing this 
prediction could shed light on the synaptic mechanisms underlying 
adaptation tasks. 

Finally, the location of the reward-based mechanism involved in 
adaptation could be the cortex-basal-ganglia network. As a matter 
of fact, there is evidence for the involvement of this network in 
pitch shift adaptation in songbirds. Although the neural correlates 
for adaptation in songbirds are unknown, when an auditory 
feedback is available to songbirds (by using miniature headphones 
[7]), the anterior frontal pathway, which is the avian homologue of 
the cortex-basal-ganglia network [50], is essential for adaptation 
based solely on binary rewards [15,16]. Thus, exploring the 
behavioral and neural differences in auditory feedback versus 
binary reward adaptations in pitch shift experiments in songbirds 
may help reveal the neural mechanisms for reward-based 
adaptation. 



The network model 

We consider a simplified computational model of a network 
performing this reaching task, see Figure IB. The input layer of 
the network encodes the sensory information regarding the 
direction of the target, 0. It is composed of N directionally tuned 
neurons labeled by their preferred direction, 0/ =jji (i= 1,2.. JV). 
For simplicity, we assume that the shape of the tuning curves is the 
same for all neurons: upon presentation of a target in direction 0 
the activity of neuron i is /(0/ — 0). We take: 



sra m r> , «w(0/ - 0) - 1 
f(6i - 0) = C exp( ) 



(2) 



where p characterizes the width of the tuning curve and C is the 
peak response of a neuron. The width of the tuning curves at half 
of its maximal activity relative to the baseline (half bandwidth) in 
this case is: 



P 

\+e2 

9 1 = arccos(p log( — — ) - 1) 

2 I 



(3) 



The second layer of the network encodes the location of the 
endpoint of the hand movement. It consists of two output units 
whose activities, r\ and represent the two components of the 
hand position, v. Upon presentation of a target in direction 0: 



(4) 



where WeR 2 x n is the connectivity matrix between the two layers, 
F(6) denotes the JV dimensional vector of the input layer with 
components F/(0) =/(0/ — 0), and %~J\f(0,<7 2 I) is a Gaussian 
noise. The location of the cursor at the end of the movement is 
related to v by a 2 x 2 rotation matrix, D y , of angle y. Therefore, 
the squared distance between the endpoint location of the cursor 
and the center of the target is: 



£ { = ||^|| 2 = ||?-/) y r|| 2 = ||?-r|| 2 



(5) 



where r = D^r. This quantity will be used to measure the error 
with which the network performs the reaching task. It is also useful 
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to define the noiseless error: 

E 0 = \\E,\\ 2 = \\r-D y y\\ 2 = \\r-y\\ 2 (6) 

where y = WF . This quantity measures the error if the noise is 
suppressed. 

Upon presentation of a target in a direction 6 at trial t, the 
network performs the task and a reward R is delivered according 
to the outcome: 

R=\ C (7) 

[ 0 otherwise 

The matrix W is then modified according to a reward- 
modulated learning rule: 



p\ for simplicity). When Eo(0)>e } the expectation value of this 
distribution, l/p\, diverges for small values of e and o. 

Learning dynamics in the limit £— »0. In the limit £— >0, the 

probability of a trial to be rewarded decreases and thus the 
number of trials between rewarded trials diverges (see Eq. 10). 
However, one can still characterize the dynamics in terms of the 
evolution of the error as a function of the number of rewarded trials. 
The condition that the network generates the k th rewarded trial 
fully determines the noise: 

m=m -^ w ( k ~ w w = E ^ k - !) ( 1 1 ) 

The connectivity matrix is then updated according to: 
W(k) = W{k -l) + tiE 0 (k-l)F T (e) 



W(t) =W(t-l) + n R(t)i;(t)F T (6(t)) 



(8) 



where rj is the learning rate. This learning rule can be derived in a 
REINFORCE framework [36]. 

We assume that at the beginning of learning (7 = 0), when there 
is no rotation, the network is able to perform the reaching task 
with zero noiseless error for all targets. When all the Fourier 
components of f(6f — 6) are non-zero, this constraint fully 
determines Wj(0): 



Wj(Q)- 



1 /cos(0,-) 



(9) 



where f\ is the first Fourier component of the tuning curves. To 
get Eq. 10, one needs to calculate the Fourier expansion of Wj(0) 
by using the constraint: 



W(0)F(6) = 



cos(0) 
sin(0) 



for each of the N possible target directions, 6. When some of the 
Fourier coefficients of the tuning curve function are zero, e.g. 
when using a cosine tuning curves, W is determined up to the 
Fourier coefficient that are irrelevant to the above constraint. This 
does not affect the learning dynamics. 

Analysis of the model for adaptation to one target 

Probability to generate a rewarded trial. The probability 
of generating a rewarded trial given the noiseless error at the end 
of the previous trial is: 



Pl {E () (t)) = Pr{R=\mt))- 



271 (T 2 



E^(i)<( 



1 -£ Ti ry/Efa) 



(10) 



= ^e 2<r 2 



2* 2 /o(- 



-)rdr 



where I n (x) is the modified Bessel function of the first kind of order 
n [51]. The transition from the second to the third equation is 
done by a change of variables to polar coordinates, followed by the 
integration over the angle. Using this equation, we can calculate 
the probability to get the first reward in a given number of trials 
for an initial noiseless error, Eo(t = 0). This probability is given by 
a geometrical distribution with a parameter P\(Eq(0)) (defined as 



E 0 (k) = (l-fj)E 0 (k-l) 
where the normalized learning rate is defined by: 

fj = rj(X (12) 
with a= ^||F((9)|| 2 . Solving the above recursion, one finds: 



\-(\-fi) k 
W(k)=W(Q)+ '—E 0 (0)F T (0) 

The error and the squared Frobenius norm of W 
{\\W\\ L =Y, ij W%) are then: 



E 0 (k) = 2(\-cosy)(\-n) 2k 



||^)|| 2 = ||W(0)|| 2 + — (cosy-l)(l-(l-#)(l-# 

(X 

where we use the fact that Eq(0) = 2(1 —cosy). 

The sequences %(k), W(k) and Eo(k) converge when k^-co if: 



Y\<2 



(13) 



Their limiting values are then: 



«oo) = 0 



W(a ) )=W(0)+-E 0 (0)F T (e s ) 



£o(°o) = 0 



\\W(<x>)\\ 2 = \\W(0)\\ 2 



(14) 
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Therefore, after enough rewarded trials the noiseless error goes 
down to zero. Note that there is no need to normalize the 
connectivity matrix after each update in this case, since in the large 
k limit the norm of the matrix returns to the value it had at k = 0. 

The support of the noiseless error distribution is 
bounded. When e is finite, the noiseless error after a rewarded 
trial is: 



E 0 (k) = \\E 0 (k)\^ = \\Eo(k-V)-ni{k)\? 



(15) 



where %(k) is such that the constraint in Eq. (7) holds, i.e., 
\\Eo(k— 1) — %(k)\\ 2 <£. This constraint implies that %(k) can be 
written as: 



i(k) = VEo(k-l)e Eo +kk) 



(16) 



where £e 0 is the unit vector in the direction of Eo(k— 1) and %(k) 
is a vector with a maximal norm yfe. Inserting Eq. (16) into Eq. 
(15) one finds: 



E max (n> 2) = 00 

To summarize: 

£ fj< 1 

, 6 ? \<Jj<2 

(2/^-1)2 ' 

oc fj>2 

In particular, if fj < 1 the noiseless error is guaranteed to always be 
smaller than e at large time. 

Generalization error after adaptation to one target. Let 

us assume that the network has adapted to the rotation of the 
target presented in direction 6. To measure the ability of the 
network to generalize to targets in other directions, we calculate 
the noiseless error for test target (E test ) } presented in a direction 
6' ¥"0 and define the generalization error by: 



G.E. = 1 —E test /Eo 



(18) 



E 0 (k)=E 0 (k-i)(i-f,) 2 +n 2 \\m\\ 2 



2*ki -nwm- 1)1 mmiik-mk) 



(17) 



The noiseless error for a large number of trials is a random 
variable with a probability P(Eq) on the support (0,E max ). For 
vector y to be close to the target, the maximum value of the 
noiseless error, E max , needs to be as small as possible. To estimate 
Emaxi we compute the realization of %(k) which maximizes the 
noiseless error, Eq. (18), at each rewarded trial k. 

When fj<l, Eo(k) is maximal if E^{k— 1)|(£) = — 1 and 
|||(fc)|| = y/l. One then gets: 



v / £W= v / ^-l)(l-^) + ^ 



Solving the recursion gives: 



In the limit £— »0 (assuming f]<2), E test can be computed 
analytically, as function of A6 = 6 — 6'. Using Eq. (15) one finds: 



E test = 2(1 - cos y)(l - 2a(A#)cos AO + a 2 (A0)) 



and 



G.E. = 2a(A#)cos AO - a 2 (AO) 



where a(A0) is: 



a(A0) = 



Na 



F T (6)F(6') 



(19) 



(20) 



depends on 6 and 6' only via AO. Note that a(0)= 1. Specifically, 
in the limit of large N and when using the tuning curve function in 
Eq. (2), one gets: 



i-\ 



and therefore after a long time we get: 
E max (¥\<\) = £ 

For rj>\, Eo(k) in Eq(17) is maximal if E^{k— l)^(k) = 1 and 
111(^)11 = ^- This leads to: 

V^bW = VEo(k -m-l) + ?iV~e 

Solving the recursion and taking the limit k^-co, one gets that for 
\<fj<2: 

E max {\<r\<2)-- 



6c(A6) = 



h(2/p) 



(21) 



where x= y/2(l +cos(A0)). 



and when rj > 2: 



Adaptation to two targets 

How does a reward affect the next trial?. Here we 
consider the case where the network adapts to two targets in the 
direction 6 and 6' presented in alternation. If a rewarded trial 
occurs for one of the targets, the connectivity matrix is updated, 
affecting the noiseless error on the next trial when the other target 
is presented. 

This noiseless error can be computed in the limit e— >0. It is a 
good estimate for the noiseless error in the beginning of the 
adaptation with finite £, where the error is still big with respect to 
the target size. Let us assume that on trial k a target in direction 0 
is presented and that it is rewarded. This condition fully 
determines the realization of the noise on trial k, %(k). The 
noiseless errors for the two targets following the rewarded trial, 
denoted Eq(Jc) and E^fk), can be determined analytically. One 
finds: 
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(22) 



E°(k) = (l-¥jfEo(k-l) 
E° 0 f (k)=\\E^(k- \)-w(A6)E° 0 (k-\)\\ 2 
= E e 0 \k-l) + 
yjE»(k- l)rjdc(A6)(rjdc(A6)^(k- 1)- 
2y^(£-l)cosA#) 



If fj<2, E^{k)<E^{k—\), that is, the noiseless error for the 
target that has been rewarded decreases following the update of 
the connectivity matrix. For the other target (direction 0'), the 
effect of this update on the noiseless errors depends on the sign of 
the expression in parentheses in the second equation. If the two 
targets are in opposite directions, it is always positive and 
Eq (k) >Eq (k— 1). Thus, while the network performs better for 
one of the targets it performs worse for the other target. We term 
this situation destructive interference. On the other hand, if the 
targets are close such that the expression in parentheses is 
negative, E^{k)<E^{k—\). In other words, if the network 
improves for one of the targets it also improves for the other target. 
We term this situation constructive interference. 

In particular, for the first rewarded trial, using Eq(0) = Eq (0), 
we get: 



where: 



E 9 0 (k+l) = E 0 0 (k)(\-Q(f,,p,A6)) 



Q(fj,p,A9) = 2rja(A9) cos A9- fj 2 a 2 (A9) 



(23) 



We expect a constructive interference for Q(Jj,p,A0)>0 and 
destructive interferences otherwise. Note that for fj = 1 the 
interference function equals to the generalization error function 
(Eq. (20)). The transition between the constructive and destructive 
regimes is given by: 



cos(Afl) = 



i/a(A0) 



The quantity Q(fj,p,A0)) characterizes the strength of the 
interference. It can be a non-monotonous function of AO. To 
show this, we calculate the derivative of Q(rj,p,A0) with respect to 
AO, using Eq. (22). This derivative changes sign when: 



cos(A 0) + p^-^=O 
h{r/p) h(2/p) 



(24) 



For instance, when ^/ = 0.3 and p<0.73 the function Q(rj,p,A0) 
depends non-monotonically on AO. In other words, for sufficiently 
narrow tuning curves, the interference varies non-monotonically 
with the angular difference. However, this non-monotonicity effect 
can be very small: if the tuning curves are too narrow, Q(rj,p, AO) 
quickly reaches zero when increasing AO. 

Numerical simulations 

In the numerical simulations described in this paper, the input 
layer consists of iV=100 neurons. We normalized the tuning 
curves (parameter C in Eq. (2)) such that a remains constant 



(a = 0.36) when changing p. This was done to guarantee that the 
time to learn one target does not depend on the tuning width. 

Learning duration and final error. We define the final 
error of the network as the median of the error over the last 1 ,000 
trials of the simulation for each realization. We then determine the 
learning duration, Tl, as the trial number at which the filtered 
signal (median filter with a window length of 50 trials) crosses a 
threshold, defined to be 5 % above the final error. In order to avoid 
boundary problems of the filter at time 0 (the discontinuity in the 
error when we induce the rotation), we calculate the error at /<0 
while assuming that the cursor is already rotated (even though it 
did not). In the figures we plot the actual error before the rotation, 
which is small. Similar results were obtained using a linear filter. 

Time dependent correlations between the errors for two 
targets. When the network adapts to a rotation for two targets 
presented in alternation in consecutive trials, the learning 
processes for the two targets interfere. This interference can be 
quantified by considering the correlations between the errors on 
two consecutive trials: 



cc(0= 



<(^(Q-<^(Q»(^(f+i)-<^(f+i)»> 
VN^(0)N^+i)) 



The brackets denote the average over repetitions of the adaptation 
process, which differ by the realization of the noise. Negative 
CC(t) indicates that if the network improves for one target it 
deteriorates for the other target (destructive interference). Positive 
CC(t) corresponds to constructive interference. 

Performance and noiseless performance. We ran long 
simulations of 1 0 7 trials to estimate the performance and noiseless 
performance after the transient learning phase. The performance 
is given by: 



<©(£-£ { (0)>, 

and the noiseless performance is given by: 
<0(£- J Eb(O)>, 



(25) 



(26) 



where 0(x) is the Heaviside function and the average is over time, 
when the transient learning phase was excluded. 

Supporting Information 

Figure SI Delayed learning effect with an on-line gradient 
ascent algorithm. A. Delayed learning in a reward function that 
varies abruptly with the error (J 1 = 0.04). B. The delayed learning 
is reduced for a smoother reward function. (J 7 = 0.05). C. The 
delayed learning almost disappears when the reward function is 
smoothed even further (7" = 0.067). Note the change of scale in the 
abscissa. Parameters: 77 = 0.1, c = 0.05, p=\. 
(EPS) 

Figure S2 Delayed learning effect in a 30° rotation for two 
targets in the 3-layer network. The reach angle (in degrees) is 
plotted as a function of the trial number. The shaded area 
corresponds to the target size. Initial conditions as explained in the 
text. Parameters: a = 0.2, £ = 0.1, ^ = 0.1, p = 1 . 
(EPS) 

Figure S3 Gradual adaptation for an 8° rotation. A. Reach 
angle (in degrees) as a function of the trial number when the 
rotation angle is increased by 1° every 40 trials up to 8°. The 
shaded area corresponds to the target size (±3° around the target 
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center). er = 0.06. B. The generalization error, given as the change 
in reach angle. The learned target is at 0°. Circles : simulation 
results. For clarity, the results are displayed for test targets sampled 
every 2.5 degrees. Solid line: analytical results. Shaded area 
corresponds to the standard deviation in generalization error in 
numerical simulations estimated over 100 repetitions. Number of 
neurons in the input layer: iV = 500. C. The shape of the tuning 
curves that was used in (A) and (B): f(0 i — O) = 

C(fl + exp( )), where C is a normalization constant 

P 

(see Materials and Methods), a = 0.14, p = 0.005. 
(EPS) 

Figure S4 Learning duration when adapting to multiple targets 
varies monotonically with the number of learned targets when 
using a gradient descent on a quadratic error function. Total 
number of target presentations required to learn the entire task vs. 
the number of presented targets, m. The targets are evenly 
distributed (between 0° to 360°). Learning duration was calculated 
as the trial number at which learning curve crossed a threshold of 
15 10" 4 . Color coded as in Figure 1 1 in the Results. Black: 



p = 0.05. Blue: p = 0.1. Purple: p = 0.3. Green: p = 0.4. Dashed 
black line corresponds to learning the targets independently. 
Compare with Figure 1 1 in main text. 
(EPS) 

Text SI This document is a supporting text for the supplemen- 
tary figures. 
(PDF) 
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